Systems and Methods for Distributed Generation of Decision Tree-Based Models

ABSTRACT

The present disclosure provides systems and methods to generate exact decision tree-based models (e.g., Random Forest models) in a distributed manner on very large datasets. In particular, the present disclosure provides an exact distributed algorithm to train Random Forest models as well as other decision forest models without relying on approximating best split search.

PRIORITY CLAIM

The present application is based on and claims priority to U.S.Provisional Application No. 62/628,608 having a filing date of Feb. 9,2018. Applicant claims priority to and the benefit of U.S. ProvisionalApplication No. 62/628,608 and incorporates U.S. Provisional ApplicationNo. 62/628,608 herein by reference in its entirety.

FIELD

The present disclosure relates generally to machine learning. Moreparticularly, the present disclosure relates to systems and methods togenerate exact decision tree-based models (e.g., Random Forest models)in a distributed manner on very large datasets.

BACKGROUND

Classification and regression problems can include predictingrespectively the class or the numerical label of an observation using acollection of training labelled records. Decision Tree (DT) learningalgorithms are a widely studied family of methods both forclassification and regression. DTs have a great expression power (DT areuniversal approximators), they are fast to build, and they are highlyinterpretable. However, controlling DT overfitting is non-trivial.

DT bagging, DT gradient-boosting, and DT boosting are three successfulsolutions aimed to tackle the DT overfitting problem. These methods(which can be collectively referred to as Decision Forest (DF) methods)can include training collections of DTs. DF methods are state of the artfor many classification and regression problems.

As well as DT learning algorithms, generic DF methods typically requirea random memory access to the dataset during training. These methods arealso non-directly computationally distributable: the cost of networkcommunication exceeds the gain of distribution. These two constraintsrestrict the usage of existing DF methods to datasets fitting in themain memory of a single computer.

Two families of approaches have been studied and sometimes combined totackle the problem of training Decision Trees (DT) and Decision Forests(DF) on large datasets: (i) Approximating the building of the tree byusing a subset of the dataset and/or approximating the computation ofthe optimal splits with a cheaper or more easily distributablecomputation, and (ii) using different but exact algorithms (building thesame models) that allow distributing the dataset and the computation.Various works have shown that (i) typically leads to bigger forests andlower precision.

SUMMARY

Aspects and advantages of embodiments of the present disclosure will beset forth in part in the following description, or can be learned fromthe description, or can be learned through practice of the embodiments.

One aspect of the present disclosure is directed to acomputer-implemented method. The method includes distributing a trainingdataset to a plurality of workers on a per-attribute basis, such thateach worker receives attribute data associated with one or moreattributes. The method includes generating one or more decision trees ona depth level-per-depth level basis. Generating the one or more decisiontrees includes performing, by each worker at each of one or more depthlevels, only a single pass over its corresponding attribute data togenerate a plurality of proposed splits of the attribute datarespectively for a plurality of live nodes.

Another aspect of the present disclosure is directed to acomputer-implemented method. The method includes obtaining, by one ormore computing devices, a training dataset comprising data descriptiveof a plurality of samples, respective attribute values for a pluralityof attributes for each of the plurality of samples, and a plurality oflabels respectively associated with the plurality of samples. The methodincludes partitioning, by the one or more computing devices, theplurality of attributes into a plurality of attribute subsets. Eachattribute subset includes one or more of the plurality of attributes.The method includes respectively assigning, by the one or more computingdevices, the plurality of attribute subsets to a plurality of workers.The method includes for each of a plurality of depth levels of adecision tree except a final level, where each depth level includes oneor more nodes and for each of two or more of the plurality of attributesand in parallel: assessing, by the corresponding worker, the attributevalue for each sample to update a respective counter associated with arespective node with which such sample is associated, wherein one ormore counters are respectively associated with the one or more nodes ata current depth level; and identifying, by the corresponding worker, oneor more proposed splits for the attribute respectively for the one ormore nodes at the current depth level respectively based at least inpart on the one or more counters respectively associated with the one ormore nodes at the current depth level. The method includes for each ofthe plurality of depth levels of the decision tree except the finallevel, selecting, by the one or more computing devices, one or morefinal splits respectively for the one or more nodes at the current depthlevel from the one or more proposed splits identified by the pluralityof workers.

Another aspect of the present disclosure is directed to acomputer-implemented method. The method includes generating, by one ormore computing devices, a decision tree with only a root. The methodincludes initializing, by the one or more computing devices, a mappingfrom a sample index to a node index. The method includes, for each of aplurality of iterations, receiving, by the one or more computingdevices, a plurality of proposed splits from a plurality of splitters.The plurality of proposed splits is respectively generated based on aplurality of attributes of a training dataset. The method includes, foreach of the plurality of iterations, selecting, by the one or morecomputing devices, a final split from the plurality of proposed splits.The method includes, for each of the plurality of iterations, updating,by the one or more computing devices, a node structure of the decisiontree based at least in part on the selected final split. The methodincludes, for each of the plurality of iterations, updating, by the oneor more computing devices, the mapping from the sample index to the nodeindex based at least in part on the selected final split and the updatednode structure. The method includes, for each of the plurality ofiterations, broadcasting, by the one or more computing devices, theupdated mapping to the plurality of splitters.

Another aspect of the present disclosure is directed to a computingsystem that includes one or more computing devices. The one or morecomputing devices are configured to implement: a manager computingmachine; and a plurality of worker computing machines coordinated by themanager computing machine. The plurality of worker computing machinesincludes a plurality of splitter worker computing machines that haveaccess to respective subsets of columns of a training dataset. Each ofthe splitter worker computing machines is configured to identify one ormore proposed splits respectively for one or more attributes to whichsuch splitter worker computing machine has access. The plurality ofworker computing machines include one or more tree builder workercomputing machines respectively associated with one or more decisiontrees. Each of the one or more tree builder worker computing machines isconfigured to select a final split from the plurality of proposed splitsidentified by the plurality of splitter worker computing machines.

Another aspect of the present disclosure is directed to acomputer-implemented method. The method includes obtaining, by one ormore computing devices, a training dataset comprising data descriptiveof a plurality of samples, respective attribute values for a pluralityof attributes for each of the plurality of samples, and a plurality oflabels respectively associated with the plurality of samples. The methodincludes partitioning, by the one or more computing devices, theplurality of attributes into a plurality of attribute subsets, eachattribute subset comprising one or more of the plurality of attributes.The method includes respectively assigning, by the one or more computingdevices, the plurality of attribute subsets to a plurality of workers.The method includes, for each of a plurality of depth levels of adecision tree except an initial level and a final level, each depthlevel comprising a plurality of live nodes: for each of two or more ofthe plurality of attributes and in parallel: assessing, by thecorresponding worker, the attribute value for each sample to update arespective counter associated with a respective node with which suchsample is associated, wherein a plurality of counters are respectivelyassociated with the plurality of live nodes at a current depth level;and identifying, by the corresponding worker, a plurality of proposedsplits for the attribute respectively for the plurality of live nodes atthe current depth level respectively based at least in part on theplurality of counters respectively associated with the plurality of livenodes at the current depth level. The method includes, for each of theplurality of depth levels of the decision tree except the initial leveland the final level: selecting, by the one or more computing devices, aplurality of final splits respectively for the plurality of live nodesat the current depth level from the plurality of proposed splitsidentified by the plurality of workers.

Other aspects of the present disclosure are directed to various systems,apparatuses, non-transitory computer-readable media, user interfaces,and electronic devices.

These and other features, aspects, and advantages of various embodimentsof the present disclosure will become better understood with referenceto the following description and appended claims. The accompanyingdrawings, which are incorporated in and constitute a part of thisspecification, illustrate example embodiments of the present disclosureand, together with the description, serve to explain the relatedprinciples.

BRIEF DESCRIPTION OF THE DRAWINGS

Detailed discussion of embodiments directed to one of ordinary skill inthe art is set forth in the specification, which makes reference to theappended figures, in which:

FIG. 1 depicts a block diagram of an example computing system accordingto example embodiments of the present disclosure.

FIG. 2 depicts a block diagram of an example computing system accordingto example embodiments of the present disclosure.

FIG. 3A depicts a block diagram of an example computing system accordingto example embodiments of the present disclosure.

FIG. 3B depicts a block diagram of an example computing device accordingto example embodiments of the present disclosure.

FIG. 3C depicts a block diagram of an example computing device accordingto example embodiments of the present disclosure.

Reference numerals that are repeated across plural figures are intendedto identify the same features in various implementations.

DETAILED DESCRIPTION

1. Introduction

Example aspects of the present disclosure are directed to systems andmethods to generate exact decision tree-based models (e.g., RandomForest models) in a distributed manner on very large datasets. Inparticular, the present disclosure provides an exact distributedalgorithm to train Random Forest models as well as other decision forestmodels without relying on approximating best split search.

More particularly, two families of approaches have been studied andsometimes combined to tackle the problem of training Decision Trees (DT)and Decision Forests (DF) on large datasets: (i) approximating thebuilding of the tree by using a subset of the dataset and/orapproximating the computation of the optimal splits with a cheaper ormore easily distributable computation, and (ii) using different butexact algorithms (i.e., algorithms that ultimately result in the samemodels) that allow distributing the dataset and the computation. Variousworks have shown that (i) typically leads to bigger forests and lowerprecision. The present disclosure focuses on the latter family ofapproaches: the present disclosure provides distributed systems andmethods which results in models that are equivalent to those that wouldbe obtained through performance of the original DT algorithm. However,the distributed nature of the systems and methods described herein allowthem to be applied to extremely large datasets, which is not possiblefor the original DT or related algorithms.

According to an aspect of the present disclosure, a massive dataset canbe distributed to a number of distributed and parallel workers. Inparticular, a computing system can distribute a training dataset to aplurality of workers on a per-attribute basis. For example, the trainingdataset can include data descriptive of a plurality of samples (e.g.,organized into rows: one sample per row), respective attribute valuesfor a plurality of attributes for each of the plurality of samples(e.g., organized into columns: one attribute per column, with each rowof the column providing an attribute value for the correspondingsample), and a plurality of labels respectively associated with theplurality of samples (e.g., a final column that contains the labels forthe samples). The computing system can partition and distribute thetraining dataset such that each worker receives attribute dataassociated with one or more attributes.

According to another aspect of the present disclosure, the computingsystem can generate one or more decision trees on a depthlevel-per-depth level basis. In some implementations, the computingsystem can generate (e.g., on a depth level-per-depth level basis)multiple decision trees in parallel. In other implementations, thecomputing system can sequentially generate multiple decision trees(e.g., one after the other). In yet further implementations, thecomputing system can generate only a single, stand-alone decision tree.

As one example technique for generating one or more decision trees on adepth level-per-depth level basis, the computing system can perform aniterative process to determine optimal splits of the nodes of the one ormore trees at a current depth level, and then iteratively proceed to thenext depth level. In particular, at each depth level, the workers canassess their respective attribute(s) and determine a proposed split foreach attribute and for each live node at the current depth level. One ormore tree builders responsible for building the one or more decisiontrees can receive the proposed splits from the workers and select afinal, optimal split for each of the live nodes from the respectivesplits proposed for the nodes by the workers.

More particularly, according to another aspect of the presentdisclosure, at each depth level, each worker can perform only a singlepass over its corresponding attribute data to generate a proposed splitof its corresponding attribute data for each of a plurality of differentnodes. Thus, during its single pass over its corresponding attributedata, each worker can generate proposed splits of the attribute datarespectively for some or all of the live nodes at a current depth level.This is in contrast to certain existing techniques (e.g., the originalDT algorithm), where a separate container of training data is generatedfor each node and the algorithm separately analyzes the data included ineach container. This is also in contrast to certain existing techniques(e.g., SLIQ/R) which perform multiple passes over the attribute data ona node-by-node basis, rather than a single pass for all nodes.

In some implementations, a single worker can generate a respectiveproposed split for an attribute for live nodes across multiple trees.That is, in some implementations in which multiple trees are generatedin parallel, a single worker can generate a proposed split of itsattribute(s) for all live nodes at a current depth level in all trees(or a subset of all trees that includes two or more trees). In otherimplementations, each worker can generate a respective proposed splitfor an attribute for all live nodes in just a single tree. That is, insome implementations in which multiple trees are generated in parallel,a single worker can be assigned to each combination of tree andattribute and can generate respective proposed splits for the live nodesat the current depth level within its assigned tree. Thus, workers canbe replicated in parallel and assigned to the same set of one or moreattribute(s) but different trees to respectively generate proposedsplits for such attribute(s) for multiple trees being generated inparallel. Other divisions of responsibility can be used as well. Forexample, a worker can work on several trees independently of each other.

As one example technique to generate proposed splits, at each depthlevel, each worker can determine whether each sample included in thetraining dataset in associated with one or more live nodes at the depthlevel. For example, in some implementations, each worker can use ashared seed-based bagging technique to compute a number of instancesthat a particular sample is included in a tree-specific training datasetassociated with a given decision tree. Additionally or alternatively,the worker can consult a sample to node mapping to determine whether asample is associated with a particular node.

For each sample associated with one or more live nodes of a currentdepth level, each worker can update one or more counters respectivelyassociated with the one or more live nodes with which such sample isassociated. In particular, the worker can update each counter based onthe sample's attribute value(s) respectively associated with theattribute(s) associated with such worker.

As one example, for categorical attributes, each worker can update, foreach live node, one or more bi-variate histograms between label valuesand attribute values respectively included in the one or more attributesassociated with such worker.

As another example, for numerical attributes, each worker cansequentially and iteratively score, for each live node, proposednumerical splits of the attribute values respectively included in theone or more attributes associated with such worker.

After updating the respective counter(s) for its attribute(s) for eachlive node, each worker can generate a proposed split for each of the oneor more live nodes at the depth level based on the counters. Forexample, the proposed split can be identified based on the final countervalues.

At each depth level, one or more tree builders responsible for buildingthe one or more decision trees can receive the proposed splits from theworkers and select a respective final split for each of the live nodes.The tree builders can effectuate the selected final splits (e.g.,generate children nodes for one or more of the live nodes and update thesample to node mapping based on the selected final split(s)), therebygenerating a new depth level for the decision trees and restarting theiterative level building process. In some implementations, the updatedsample to node mapping can be broadcasted to all of the splitterworkers.

According to another aspect of the present disclosure, in someimplementations, the sample to node mapping can be wholly stored involatile memory (e.g., random access memory). In other implementations,the sample to node mapping can be distributed into a number of chunksand one or more of the chunks (e.g., the chunk currently being used bythe worker(s)) can be stored in volatile memory while the other chunks(e.g., those not currently being used) can be stored in non-volatilememory (e.g., a disk drive). Thus, only a part of the mapping needs toreside in volatile memory at any instant, which advantageously provideslower volatile memory usage.

In the following sections, the present disclosure explains examplesystems, methods, and algorithmic implementations of the conceptsdescribed herein in further detail. In particular, among other examples,the present disclosure provides a distributed and exact implementationof Random Forest able to train on datasets larger than in any such pastwork, which can in some instances be referred to as “Distributed RandomForest” (DRF).

The methods described herein stand out from existing exact distributedapproaches by a smaller space, disk and network complexity. Inparticular, various implementations of the present disclosure canprovide the following benefits: (1) Removal of the random access memoryrequirement; (2) Distributed training (distribution even of a singletree); (3) Distribution of the training dataset (i.e. no worker requiresaccess to the entire dataset); (4) Minimal number of passes in terms ofreading/writing on disk and network communication; and/or (5)Distributed computing of feature importance.

U.S. Provisional Application No. 62/628,608, which is incorporatedherein by reference, compares example implementations of the presentdisclosure to related approaches for various complexity measures (time,ram, disk, and network complexity analysis). Further, U.S. ProvisionalApplication No. 62/628,608 reports their running performances onartificial and real-world datasets of up to 18 billion examples. Thisfigure is several orders of magnitude larger than datasets tackled inthe existing literature. U.S. Provisional Application No. 62/628,608also empirically shows that Random Forest benefits from being trained onmore data, even in the case of already gigantic datasets.

2. Example Distributed Random Forest Technique

This section describes a proposed Distributed Random Forest algorithm(DRF). The structure of the DRF algorithm is different from theclassical recursive Random Forest algorithm; nonetheless, the proposedalgorithm is guaranteed to produce the same model as RF.

The proposed method aims to reach: (1) Removal of the random accessmemory requirement. (2) Distributed training (distribution even of asingle tree). (3) Distribution of the training dataset (i.e. no workerrequires access to the entire dataset). (4) Minimal number of passes interms of reading/writing on disk and network communication. (5)Distributed computing of feature importance. While the presentdisclosure mainly focuses on Random Forests, the proposed algorithm canbe applied to other DF models, notably Gradient Boosted Trees (Ye etal., 2009).

Throughout this section, the DRF algorithm is generally compared to twoexisting methods that fall in the same category: Sprint (Shafer et al.,1996) and distributed versions of Sliq (Mehta et al., 1996).

DRF computation can be distributed among computing machines called“workers”, and coordinated by a “manager”. The manager and the workerscan communicate through a network. DRF is relatively insensitive to thelatency of communication (see, e.g., network complexity analysis in U.S.Provisional Application No. 62/628,608).

DRF can also distributes the dataset between workers: each worker isassigned to a subset of columns (most often) or sometimes a subset ofrows (for evaluators or if sharding is added) of the dataset. Eachworker only needs to read their assigned part of the datasetsequentially. Thus, according to an aspect of the present disclosure, norandom access and no writing are needed. Workers can be configured toload the dataset in memory, or to access the dataset on drive/throughnetwork access.

Finally, each worker can host a certain number of threads. While workerscommunicate between each other through a network (with potentially highlatency), it is assumed that the threads of a given worker have accessto a shared bank of memory. Most of the steps that compose DRF can bemultithreaded.

Several types of workers are responsible for different operations. Thesplitter workers look for optimal candidate splits. Each splitter hasaccess to a subset of dataset columns. The tree builder workers hold thestructure of one DT being trained (one DT per tree builder) andcoordinate the work of the splitters. Tree builders do not have accessto the dataset. One tree builder can control several splitters, and onesplitter can be controlled by several tree builders.

The OOB evaluator workers evaluate continuously the out-of-bag (OOB)error of the entire forest trained so far. Each evaluator has access toa subset of the dataset rows.

The manager manages the tree builders and the evaluators. The manager isresponsible for the fully trained trees. The manager does not haveaccess to the dataset.

Unlike the generic DT learning algorithm, DRF builds DTs “depth level bydepth level.” That is, all the nodes at a given depth are trainedtogether. The training of a single tree is distributed among theworkers. Additionally, as trees of a Random Forest are independent, DRFcan train all the trees in parallel. DRF can also be used to trainco-dependent sets of trees (e.g. Boosted Decision Trees). In this case,while trees cannot be trained in parallel, the training of eachindividual tree is still distributed.

The following subsections provide description of example implementationsof and pseudocode for the DRF concepts.

2.1 Example Dataset Preparation

Presorting can be performed for numerical attributes. The most expensiveoperation when preparing the dataset is the sorting of the numericalattributes. In case of large datasets, this operation can be done usingexternal sorting. In this phase, the manager distributes the datasetamong the splitters. Each splitter is assigned with a subset of thedataset columns. In case several DTs are trained in parallel (e.g. RF),DRF benefits from having workers replicated. In particular, severalworkers can own the same part of the dataset and can be able to performthe same computation.

The first stage of the algorithm includes preparing the training setD={(x_(i,j), y_(i)); i=1, . . . , n; j=1, . . . , m} where n is thenumber of samples, and m is the number of columns (also calledattributes or features).

First, a unique dense integer index can be computed for each sample. Ifavailable, this index is simply the index i of the sample in thedataset. Next, the dataset can be re-ordered column-wise in increasingorder of the sample indexes, and each column can be divided into pshards: For each column, the shard k contains the samples i ∈ [t_(k);t_(k+1)] with t_(p+1)=n. Finally, each numerical column can be sorted byincreasing attribute value.

A sorted column can be a list of tuples <attribute value, label value,sample index, (optionally) sample weight>. The most expensive operationwhen preparing the dataset is the sorting of the numerical attributes.In case of large datasets, this operation can be done using externalsorting.

2.2 Example Dataset Distribution

In this phase, the manager can distribute the dataset among thesplitters and the evaluator workers. Each splitter can be assigned witha subset of the dataset columns, and each evaluator can be assigned witha subset of the dataset shards. In case several DTs are trained inparallel (e.g. RF), DRF benefits from having workers replicated i.e.several workers own the same part of the dataset and are able to performthe same computation.

2.3 Example Seeding

RF “bags” samples (i.e. sampling with replacement, n out of n records)used to build each tree. Instead of sending indices over the network,DRF can use a deterministic pseudorandom generator so that all workersagree on the set of bagged examples without network communication.

More particularly, for each tree, each sample i is selected b_(i) timeswith b_(i) sampled from the Binomial distribution corresponding to ntrials with success probability 1/n. Pre-computing and storing b_(i) foreach example is prohibitively expensive for large datasets.

Instead, in some implementations, DRF can compute b_(i) on the fly usinga fast pseudo random generator function: b_(i)=bag(i, p) with i thesample index and p the tree index. b_(i)=bag(i, p) is a deterministicfunction. DRF can use an implementation of b_(i)=bag(i, p), for example,as proposed in Algorithm 6. This algorithm is a fixed number of steps oflinear congruential generator that uses i and p as seeds. Thisimplementation is a low quality random generator, but it is fast andsufficient for the bagging task.

Algorithm 6 Computation of bag(i, p) a, b and m are three fixed largeprime numbers, and n an integer (e.g. n = 3). k

 cdf(k) is the cumulative distribution of the Binomial with n trials and${{probability}\mspace{14mu} {success}\mspace{14mu} \frac{1}{n}},{{{{cdf}(k)}\mspace{14mu} {values}\mspace{14mu} {are}\mspace{14mu} {pre}\text{-}{computed}\mspace{14mu} {for}\mspace{14mu} k} \in \left\lbrack {0,K} \right\rbrack}$(e.g. K = 10). c ← i for k ← 0, . . . , n do c ← (ac + b) % m c ← c + pfor k ← 0, . . . , n do c ← (ac + b) % m υ ← c/m for all k ← 0, . . . ,K do  if υ ≤ cdf(k) then returns k end for returns K + 1

With this method, all workers are aware of the selected samples, withoutthe cost of transmitting or storing this information. The random-accessproperty removes the need for storing the samples in memory.

Similarly, Random Forest requires selecting a random subset of candidateattributes to evaluate at each node of each tree. Following the samemethod, DRF uses the deterministic function candidate (j, h, p), wherecandidate (j, h, p) specifies if the attribute j is considered for thenode h of the tree p, and with candidate (

,

,

) following a binary distribution of success probability 1/√{square rootover (d)}.

2.4 Example Mapping of Sample Indices to Node Indices

At any point during training, each bagged sample is attached to a singleleaf—initially the root node. When a leaf is derived into two children,each sample of this node is re-assigned to one of its child nodesaccording to the result of the node condition (condition=chosen split).In some implementations, DRF splitters and tree builders need torepresent the mapping from a sample index to a leaf node.

DRF monitors the number

of active leaves (i.e., number of leaf nodes which can be furtherderived). Therefore, [log₂

] bits of information are needed to index a leaf If there is at leastone non-active leaf, [log₂(

+1)] bits are needed to encode the case of a sample being in a closedleaf. Therefore, this mapping requires n[log₂(

+1)] bits of memory to store in which leaf each sample is.

Depending on the size of the dataset, this mapping can either be storedentirely in memory, or the mapping can be distributed among severalchunks such that only one chunk is in memory at any time. The timecomplexity of DRF essentially increases linearly with the number ofchunks for this mapping.

Unlike Sliq, DRF does not need to store the label values in memory.

2.5 Example Techniques for Finding the Best Split

During training, each splitter is searching for the optimal split amongthe candidate attributes it owns. The final optimal split is the bestoptimal split among all the splitters. The optimal split is defined asthe split with the highest split score. As examples, either theInformation Gain or the Gini Index can be used as split scores.

A split is defined as a column index j and a condition over the valuesof this column. For numerical columns, the condition is of the formx_(i,j)≤with τ ∈

. For categorical columns, the condition is of the form x_(i,j) ∈ C withC ∈ 2^(S) ^(j) and S_(j) the support of column j. In case of attributesampling (e.g. RF), only a random subset of attributes are considered.The super split can refer to a set of splits mapped one-to-one with theopen leaves at a given depth of a tree.

The following subsections present examples of how DRF can compute theoptimal splits for all the nodes at a given depth, i.e. the optimalsuper split at a given depth, in a single pass per feature. Computingoptimal splits on categorical attributes is easily parallelized, whereascomputing optimal splits in the case of numerical attributes needspresorting. These two cases are now discussed.

2.5.1 Categorical Attributes

Estimating the best condition for a categorical attribute j and in leafh can include computing the bi-variate histogram between the attributevalues and the label values for all the samples in h. The optimal (incase of binary labels) or approximate (in case of multiclass labels)split can then be identified using any number of techniques (see, e.g.,L. Breiman et al., Classification and Regression Trees. Chapman & Hall,New York, 1984).

For a given categorical attribute j, given the mapping from the sampleindex to the open leaf index, a splitter computes this bi-histogram foreach of the open leaves through a single sequential iteration on therecords of the attribute j.

An example listing is given in Algorithm 7. The iteration over thesamples can be trivially parallelized (multithreading over sharding).

Algorithm 7 Find the best supersplits for categorical attribute j andtree p. Nodes are open when they are still subject to splitting -typically nodes are closed when they reach some purity level or whentheir cardinal is below some threshold. H_(h∈[1,l]) is an emptybi-histogram between the labels and the attribute j for the leaf l forall i in 1,...,n    // This loop can    be parallelized do   h ←sample2node(i)   if h is a closed node then continue   if candidatefeature(j,h,p) is false then continue   B ← bag(i,p) // Number of timesi is sampled in tree p   if B = 0 then continue   Add (x_(i,j),y_(i))weighted by B to H_(h) end for for all open leaf h do   Find bestcondition using bi-histogram H_(h) end for

2.5.2 Numerical Attributes

Estimating the exact best threshold for a numerical attribute caninclude performing a sequential iteration over all the samples inincreasing order of the attribute values.

Suppose q(k,j,h) the sample index of the kth element sorted according tothe attribute j in the node h i.e. x_(q(0,j,h),j)≤x_(q(1,j,h)j)≤ . . .≤x_(q(n) _(h) _(−1,j,h),j). During this iteration, the average of eachtwo successive attribute values (x_(q(k,j,h),j)+x_(q(k+1,j,h),j))/2 is acandidate values for τ. The score of each candidate can be computed fromthe label values of the already traversed samples, and the label valuesof the remaining samples.

For a given numerical attribute j, given the mapping from the sampleindex to open leaf index, a splitter estimates the optimal threshold foreach of the open leaves through a single sequential iteration on therecords ordered according to the values of the attribute j. Since therecords are already sorted by attribute values (see, e.g., section 2.1),no sorting is required for this step. One example listing is given inAlgorithm 8.

Algorithm 8 Find the best supersplits for numerical attribute j and treep H_(h∈[1,l]) will be the histograms of the already traversed labels forthe leaf l (initially empty). v_(h∈[1,l]) is the last tested threshold(initially null) for the leaf l. q(j) is the list of records sortedaccording to the attribute j i.e. q(j) is a list of tuples (a,b,i),sorted in increasing order of a, where a is the numerical attributevalue, b is the label value, and i is the sample index. {t_(l)} will bethe best τ for leaf l (initially null). {s_(l)} will be the score oft_(l) (initially null). for all (a,b,i) in q(j) do   h ← sample2node(i)  if h is a closed node then continue   if candidate feature(j,h,p) isfalse then continue   B ← bag(i,p)   if B = 0 than continue   τ ← (a +v_(h))/2   s′ ← the score of τ (computed using H_(h))   if s′ > s_(h)then    s_(h) ← s′    t_(h) ← τ   end if   Add y_(i) weighted by B toH_(h) for label b   v_(h) ← a end for return {t_(l)} and {s_(l)}

2.6 Example Technique for Training a Decision Tree

Each decision tree can be built by a tree builder. For example,Algorithm 9 provides one example technique for building a decision tree.

Algorithm 9 Tree builder algorithm for DRF. 1: Create a decision treewith only a root. Initially, the root is the only open leaf. 2:Initialize the mapping from sample index to node index so that allsamples are assigned to the root. 3: Query the splitters for the optimalsupersplit. Each splitter returns a partial optimal supersplit computedonly from the columns it has access to (using Alg. 8 in the case ofnumerical splits). The (global) optimal super split is chosen by thetree builder by comparing the answers of the splitters. 4: Update thetree structure with the optimal best supersplit. 5: Query the splittersfor the evaluation of all the conditions in the best supersplit. Eachsplitter only evaluates the conditions it has found (if any). Eachsplitter sends the results to the tree builder as a dense bitmap. Intotal, all the splitters are sending one bit of information for eachsample selected at least once in the bagging and still in an open leaf.6: Compute the number of active leaves and update the mapping fromsample index to node index. 7: Broadcast the evaluation of conditions toall the splitters so they can also update their sample index to nodeindex mapping. 8: Close leaves with not enough records or no goodconditions. 9: If at least one leaf remains open, go to step 3. 10: Send the DT to the manager.

2.7 Example Technique for Training a Random Forest

To train a Random Forest, the manager queries in parallel the treebuilders. This query contains the index of the requested tree (the treeindex is used in the seeding, see, e.g., section 2.3) as well as a listof splitters such that each column of the dataset is owned by at leastone splitter. The answer by the tree builder is the decision tree.

2.8 Example Technique for Continuous Out-of-Bag Evaluation

The Out-Of-Bag (OOB) evaluation is the evaluation of a RF on thetraining dataset, such that each tree is only applied on samplesexcluded from their own bagging. OOB evaluation allows evaluation of aRF without a validation dataset. Computing continuously the OOBevaluation of a RF during training is an effective way to monitor thetraining and detect the convergence of the model.

During training, after the completion of each DT (or as specified by awalltime), the manager can send the new trees to a set of evaluatorssuch that, together, the set of evaluators covers the entire dataset(e.g., the dataset is distributed row-wise among the evaluators). Eachevaluator then estimates the OOB evaluation of the RF on their samples.Evaluating bag(i, p) on the fly, evaluators can detect if a particularsample i was used to train a particular tree p. The partial OOBevaluation are then sent back to and aggregated by the manager. The samemethod can be used to compute the importance of each feature.

3. Example Complexity Analysis and Technical Effects and Benefits

U.S. Provisional Application No. 62/628,608 presents and compares insignificant detail the theoretical complexities (memory, parallel time,I/O and network) of generic DT, generic RF, DRF, Sprint, Sliq, Sliq/Rand Sliq/D. However, example technical effects and benefits of DRF andthe main advantages of DRF over Sprint and Sliq/D-R are:

A smaller memory consumption per worker; e.g., compared to Sprint, DEFcan reach, per worker, num records×(1+log₂ max_(i)(num leaves at depthi)) bits, instead of num records×sizeof (record index) with sizeof(record index) equal to 64 bits for large datasets. Note: The memoryconsumption of DRF can be further reduced at the cost of an increase intime complexity.

A smaller amount and number of passes over data and of networkcommunications. DRF's number of passes over data and networkcommunication is proportional to the depth of the tree; while it isproportional to the number of nodes for Sprint, Sliq/D and Sliq/R. Thetotal number of exchanged bits is also smaller for DRF. The networkusage of Sliq/D is even greater since the node location of each sampleis only known by one worker, and since all the workers need access tothis information. DRF benefits from the communication efficientsynchronous sample bagging schema (see, e.g., section 2.3).

Further, in the case of a large dataset, the data can be distributed inseveral machines, work-centers, countries, and/or continents. Thealgorithms proposed herein work nicely with this situation (e.g.,because of the small number of back and forth communication between theworkers). This also means that splitters can be distributed to be asclose as possible to their data.

The absence of need for disk writing during training. DRF only writes ondisk during the initialization phase (unless the workers are configuredto keep the dataset in memory; in which case there are not disk writingat all). In comparison, during training, Sprint writes on disk theequivalent of several times the training dataset—for each tree in caseof a forest.

All these algorithms operate differently, and benefit from differentsituations in term of time complexity:

Sprint prunes records in closed leafs: a tree with a large amount ofrecords in shallow closed leafs is fast to train. However, Sprint scansand writes continuously both the candidate and non-candidate featuresi.e. Sprint does not benefit from the small size of the set of candidatefeatures.

Compared to Sprint, DRF benefits from records being in closed leafsdifferently: records in closed leafs are not pruned, but since Sliq andDRF only scan candidate features (i.e. features randomly chosen and notclosed in earlier conditions), a smaller number of records leads to asmaller number of candidate features. Although our experiments focus onthe classical case of features randomly drawn at each node, we point outthat Sliq and DRF benefit greatly (by a factor proportional to thenumber of features) from limiting the number of unique candidatefeatures at a given depth. In particular, the trend consisting in usingthe same set of features for all nodes at a given depth leads to a fastDRF with a number of machines proportional to the number of randomlydrawn features instead of the total number of features.

U.S. Provisional Application No. 62/628,608 also provides a study of theimpact of equipping DRF with a mechanism to prune records similarly toSprint: when DRF detects that this pruning becomes beneficial, thealgorithm can prune the records in closed leafs. This operation is nottriggered during the experimentation on the large dataset reported inU.S. Provisional Application No. 62/628,608.

4. Example Computing Systems and Devices

FIGS. 1-3C provide examples of computing systems and devices that can beused in accordance with aspects of the present disclosure. Thesecomputing systems and devices are provided as examples only. Manydifferent systems, devices, and configurations thereof can be used toimplement aspects of the present disclosure.

FIG. 1 depicts an exemplary distributed computing system 10 according toexemplary embodiments of the present disclosure. The architecture of theexemplary system 10 includes a single manager computing machine 12(hereinafter “manager”) and multiple worker computing machines (e.g.,worker computing machines 14, 16, and 18 (hereinafter “worker”).Although only three workers 14-18 are illustrated, the system 10 caninclude any number of workers, including, for instance, hundreds ofworkers with thousands of cores.

The workers 14-18 can include machines configured to perform a number ofdifferent tasks. For example, the workers 14-18 can include tree buildermachines, splitter machines, and/or evaluator machines.

Each of the manager computing machine 12 and the worker computingmachines 14-18 can include one or more processing devices and anon-transitory computer-readable storage medium. The processing devicecan be a processor, microprocessor, or a component thereof (e.g., one ormore cores of a processor). In some implementations, each of the managercomputing machine 12 and the worker computing machines 14-18 can havemultiple processing devices. For instance, a single worker computingmachine can utilize or otherwise include plural cores of one or moreprocessors.

The non-transitory computer-readable storage medium can include any formof computer storage device, including RAM (e.g., DRAM), ROM (e.g.,EEPROM), optical storage, magnetic storage, flash storage, solid-statestorage, hard drives, etc. The storage medium can store one or more setsof instructions that, when executed by the corresponding computingmachine, cause the corresponding computing machine to perform operationsconsistent with the present disclosure. The storage medium can alsostore a cache of data (e.g., previously observed or computed data).

The manager computing machine 12 and the worker computing machines 14-18can respectively communicate with each other over a network. The networkcan include a local area network, a wide area network, or somecombination thereof. The network can include any number of wired orwireless connections. Communication across the network can occur usingany number of protocols.

In some implementations, two or more of the manager computing machine 12and the worker computing machines 14-18 can be implemented using asingle physical device. For instance, two or more of the managercomputing machine 12 and the worker computing machines 14-18 can bevirtual machines that share or are otherwise implemented by a singlephysical machine (e.g., a single server computing device).

In one exemplary implementation, each of the manager computing machine12 and the worker computing machines 14-18 is a component of a computingdevice (e.g., server computing device) included within a cloud computingenvironment/system.

According to an aspect of the present disclosure, the manager 12 can actas the orchestrator and can be responsible for assigning work, while theworkers 14-18 can execute the computationally expensive parts of thealgorithms described herein. Both the manager 12 and workers 14-18 canbe multi-threaded to take advantage of multi-core parallelism.

In some implementations, the manager manages workers that include treebuilders and evaluators. The manager is responsible for the fullytrained trees. In some implementations, the manager does not have accessto the dataset.

FIG. 2 shows an example arrangement of worker computing machines. Inparticular, as illustrated in FIG. 2, the worker machines can includeseveral types of workers that are responsible for different operations.The splitter workers can look for optimal candidate splits. Eachsplitter can have access to a subset of dataset columns. The treebuilder workers can hold the structure of one DT being trained (one DTper tree builder) and can coordinate the work of the splitters. In someimplementations, tree builders do not have access to the trainingdataset. One tree builder can control several splitters, and onesplitter can be controlled by several tree builders.

FIG. 3A depicts a block diagram of an example computing system 100according to example embodiments of the present disclosure. The system100 includes a user computing device 102, a server computing system 130,and a training computing system 150 that are communicatively coupledover a network 180.

The user computing device 102 can be any type of computing device, suchas, for example, a personal computing device (e.g., laptop or desktop),a mobile computing device (e.g., smartphone or tablet), a gaming consoleor controller, a wearable computing device, an embedded computingdevice, or any other type of computing device.

The user computing device 102 includes one or more processors 112 and amemory 114. The one or more processors 112 can be any suitableprocessing device (e.g., a processor core, a microprocessor, an ASIC, aFPGA, a controller, a microcontroller, etc.) and can be one processor ora plurality of processors that are operatively connected. The memory 114can include one or more non-transitory computer-readable storagemediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magneticdisks, etc., and combinations thereof. The memory 114 can store data 116and instructions 118 which are executed by the processor 112 to causethe user computing device 102 to perform operations.

In some implementations, the user computing device 102 can store orinclude one or more machine-learned models 120. For example, themachine-learned models 120 can be or can otherwise include variousmachine-learned decision-tree based models such as, for example,classification and/or regression trees; iterative dichotomiser 3decision trees; C4.5 decision trees; chi-squared automatic interactiondetection decision trees; decision stumps; conditional decision trees;etc. Decision tree-based models can be boosted models, random forestmodels, or other types of models.

In some implementations, the one or more machine-learned models 120 canbe received from the server computing system 130 over network 180,stored in the user computing device memory 114, and then used orotherwise implemented by the one or more processors 112. In someimplementations, the user computing device 102 can implement multipleparallel instances of a single machine-learned model 120.

Additionally or alternatively, one or more machine-learned models 140can be included in or otherwise stored and implemented by the servercomputing system 130 that communicates with the user computing device102 according to a client-server relationship. For example, themachine-learned models 140 can be implemented by the server computingsystem 140 as a portion of a web service. Thus, one or more models 120can be stored and implemented at the user computing device 102 and/orone or more models 140 can be stored and implemented at the servercomputing system 130.

The user computing device 102 can also include one or more user inputcomponent 122 that receives user input. For example, the user inputcomponent 122 can be a touch-sensitive component (e.g., atouch-sensitive display screen or a touch pad) that is sensitive to thetouch of a user input object (e.g., a finger or a stylus). Thetouch-sensitive component can serve to implement a virtual keyboard.Other example user input components include a microphone, a traditionalkeyboard, or other means by which a user can provide user input.

The server computing system 130 includes one or more processors 132 anda memory 134. The one or more processors 132 can be any suitableprocessing device (e.g., a processor core, a microprocessor, an ASIC, aFPGA, a controller, a microcontroller, etc.) and can be one processor ora plurality of processors that are operatively connected. The memory 134can include one or more non-transitory computer-readable storagemediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magneticdisks, etc., and combinations thereof. The memory 134 can store data 136and instructions 138 which are executed by the processor 132 to causethe server computing system 130 to perform operations.

In some implementations, the server computing system 130 includes or isotherwise implemented by one or more server computing devices. Ininstances in which the server computing system 130 includes pluralserver computing devices, such server computing devices can operateaccording to sequential computing architectures, parallel computingarchitectures, or some combination thereof.

As described above, the server computing system 130 can store orotherwise include one or more machine-learned models 140. For example,the models 140 can be or can otherwise include various machine-learneddecision tree-based models such as, for example, classification and/orregression trees; iterative dichotomiser 3 decision trees; C4.5 decisiontrees; chi-squared automatic interaction detection decision trees;decision stumps; conditional decision trees; etc. Decision tree-basedmodels can be boosted models, Random Forest models, or other types ofmodels.

The user computing device 102 and/or the server computing system 130 cantrain the models 120 and/or 140 via interaction with the trainingcomputing system 150 that is communicatively coupled over the network180. The training computing system 150 can be separate from the servercomputing system 130 or can be a portion of the server computing system130.

The training computing system 150 includes one or more processors 152and a memory 154. The one or more processors 152 can be any suitableprocessing device (e.g., a processor core, a microprocessor, an ASIC, aFPGA, a controller, a microcontroller, etc.) and can be one processor ora plurality of processors that are operatively connected. The memory 154can include one or more non-transitory computer-readable storagemediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magneticdisks, etc., and combinations thereof. The memory 154 can store data 156and instructions 158 which are executed by the processor 152 to causethe training computing system 150 to perform operations. In someimplementations, the training computing system 150 includes or isotherwise implemented by one or more server computing devices.

The training computing system 150 can include a model trainer 160 thattrains the machine-learned models 120 and/or 140 stored at the usercomputing device 102 and/or the server computing system 130 usingvarious training or learning techniques, such as, for example, any ofthe examples training techniques described herein, including, forexample, DRF or variants thereof. The model trainer 160 can perform anumber of generalization techniques (e.g., weight decays, dropouts,etc.) to improve the generalization capability of the models beingtrained.

In particular, the model trainer 160 can train the machine-learnedmodels 120 and/or 140 based on a set of training data 142. The trainingdata 142 can include, for example, data descriptive of a plurality ofsamples (e.g., organized into rows: one sample per row), respectiveattribute values for a plurality of attributes for each of the pluralityof samples (e.g., organized into columns: one attribute per column, witheach row of the column providing an attribute value for thecorresponding sample), and a plurality of labels respectively associatedwith the plurality of samples (e.g., a final column that contains thelabels for the samples). The training computing system 150 can partitionand distribute the training dataset such that each worker receivesattribute data associated with one or more attributes.

In some implementations, if the user has provided consent, the trainingexamples can be provided by the user computing device 102. Thus, in suchimplementations, the model 120 provided to the user computing device 102can be trained by the training computing system 150 on user-specificdata received from the user computing device 102. In some instances,this process can be referred to as personalizing the model.

In some implementations, the training computing system 150 can implementthe model trainer 160 across or using multiple computing machines. Forexample, the model trainer 160 can take the form of the example systemsillustrated in FIGS. 1 and 2.

The model trainer 160 includes computer logic utilized to providedesired functionality. The model trainer 160 can be implemented inhardware, firmware, and/or software controlling a general purposeprocessor. For example, in some implementations, the model trainer 160includes program files stored on a storage device, loaded into a memoryand executed by one or more processors. In other implementations, themodel trainer 160 includes one or more sets of computer-executableinstructions that are stored in a tangible computer-readable storagemedium such as RAM hard disk or optical or magnetic media.

The network 180 can be any type of communications network, such as alocal area network (e.g., intranet), wide area network (e.g., Internet),or some combination thereof and can include any number of wired orwireless links. In general, communication over the network 180 can becarried via any type of wired and/or wireless connection, using a widevariety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP),encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g.,VPN, secure HTTP, SSL).

FIG. 3A illustrates one example computing system that can be used toimplement the present disclosure. Other computing systems can be used aswell. For example, in some implementations, the user computing device102 can include the model trainer 160 and the training dataset 162. Insuch implementations, the models 120 can be both trained and usedlocally at the user computing device 102. In some of suchimplementations, the user computing device 102 can implement the modeltrainer 160 to personalize the models 120 based on user-specific data.

FIG. 3B depicts a block diagram of an example computing device 40 thatperforms according to example embodiments of the present disclosure. Thecomputing device 40 can be a user computing device or a server computingdevice.

The computing device 40 includes a number of applications (e.g.,applications 1 through N). Each application contains its own machinelearning library and machine-learned model(s). For example, eachapplication can include a machine-learned model. Example applicationsinclude a text messaging application, an email application, a dictationapplication, a virtual keyboard application, a browser application, etc.

As illustrated in FIG. 3B, each application can communicate with anumber of other components of the computing device, such as, forexample, one or more sensors, a context manager, a device statecomponent, and/or additional components. In some implementations, eachapplication can communicate with each device component using an API(e.g., a public API). In some implementations, the API used by eachapplication is specific to that application.

FIG. 3C depicts a block diagram of an example computing device 50 thatperforms according to example embodiments of the present disclosure. Thecomputing device 50 can be a user computing device or a server computingdevice.

The computing device 50 includes a number of applications (e.g.,applications 1 through N). Each application is in communication with acentral intelligence layer. Example applications include a textmessaging application, an email application, a dictation application, avirtual keyboard application, a browser application, etc. In someimplementations, each application can communicate with the centralintelligence layer (and model(s) stored therein) using an API (e.g., acommon API across all applications).

The central intelligence layer includes a number of machine-learnedmodels. For example, as illustrated in FIG. 3C, a respectivemachine-learned model (e.g., a model) can be provided for eachapplication and managed by the central intelligence layer. In otherimplementations, two or more applications can share a singlemachine-learned model. For example, in some implementations, the centralintelligence layer can provide a single model (e.g., a single model) forall of the applications. In some implementations, the centralintelligence layer is included within or otherwise implemented by anoperating system of the computing device 50.

The central intelligence layer can communicate with a central devicedata layer. The central device data layer can be a centralizedrepository of data for the computing device 50. As illustrated in FIG.3C, the central device data layer can communicate with a number of othercomponents of the computing device, such as, for example, one or moresensors, a context manager, a device state component, and/or additionalcomponents. In some implementations, the central device data layer cancommunicate with each device component using an API (e.g., a privateAPI).

5. Additional Disclosure

The technology discussed herein makes reference to servers, databases,software applications, and other computer-based systems, as well asactions taken and information sent to and from such systems. Theinherent flexibility of computer-based systems allows for a greatvariety of possible configurations, combinations, and divisions of tasksand functionality between and among components. For instance, processesdiscussed herein can be implemented using a single device or componentor multiple devices or components working in combination. Databases andapplications can be implemented on a single system or distributed acrossmultiple systems. Distributed components can operate sequentially or inparallel.

While the present subject matter has been described in detail withrespect to various specific example embodiments thereof, each example isprovided by way of explanation, not limitation of the disclosure. Thoseskilled in the art, upon attaining an understanding of the foregoing,can readily produce alterations to, variations of, and equivalents tosuch embodiments. Accordingly, the subject disclosure does not precludeinclusion of such modifications, variations and/or additions to thepresent subject matter as would be readily apparent to one of ordinaryskill in the art. For instance, features illustrated or described aspart of one embodiment can be used with another embodiment to yield astill further embodiment. Thus, it is intended that the presentdisclosure cover such alterations, variations, and equivalents.

What is claimed is:
 1. A computer-implemented method comprising:distributing, by one or more computing devices, a training dataset to aplurality of workers on a per-attribute basis, such that each workerreceives attribute data associated with one or more attributes;generating, by the one or more computing devices, one or more decisiontrees on a depth level-per-depth level basis, wherein generating the oneor more decision trees comprises performing, by each worker at each ofone or more depth levels, only a single pass over its correspondingattribute data to generate a plurality of proposed splits of theattribute data respectively for a plurality of live nodes; andproviding, by one or more computing devices, the one or more decisiontrees as an output.
 2. The computer-implemented method of claim 1,wherein generating, by the one or more computing devices, one or moredecision trees on a depth level-per-depth level basis comprisessimultaneously generating, by the one or more computing devices, aplurality of depth trees on the depth level-per-depth level basis. 3.The computer-implemented method of claim 1, wherein performing, by eachworker at each of the one or more depth levels, only the single passover its corresponding attribute data to generate the plurality ofproposed splits comprises performing, by each worker in parallel withall other workers, only the single pass over its corresponding attributedata to generate the plurality of proposed splits.
 4. Thecomputer-implemented method of claim 1, further comprising:partitioning, by the one or more computing devices, the training datasetinto a plurality of shards, each shard containing one or more samples;and performing, by the one or more computing devices, out of bagevaluation of the one or more decision trees using the plurality ofshards.
 5. The computer-implemented method of claim 1, whereinperforming, by each worker at each of the one or more depth levels, onlythe single pass over its corresponding attribute data to generate theplurality of proposed splits comprises performing, by each worker ateach of the one or more depth levels, the single pass over itscorresponding attribute data in a sequential fashion to generate theplurality of proposed splits.
 6. The computer-implemented method ofclaim 1, wherein performing, by each worker at each of the one or moredepth levels, only the single pass over its corresponding attribute datacomprises: at each depth level: determining, by each worker, whethereach sample included in the training dataset is associated with one ormore of the plurality of live nodes at the depth level; and generating,by each worker, the proposed split for each of the plurality of livenodes at the depth level, wherein the proposed split for each live nodeis based on the attribute data associated with samples that weredetermined to be associated with such live node.
 7. Thecomputer-implemented method of claim 1, wherein performing, by eachworker at each of the one or more depth levels, only the single passover its corresponding attribute data comprises: at each depth level:determining, by each worker, whether each sample included in thetraining dataset is associated with one or more of the plurality of livenodes at the depth level; and for each sample associated with one ormore of the live nodes, updating, by each worker, one or more countersrespectively associated with the one or more live nodes with which suchsample is associated based at least in part one or more attribute valuesrespectively associated with the one or more attributes associated withsuch worker.
 8. The computer-implemented method of claim 7, whereinupdating, by each worker, the one or more counters respectivelyassociated with the one or more live nodes comprises updating, by eachworker and for each live node, one or more bi-variate histograms betweenlabel values and attribute values respectively included in the one ormore attributes associated with such worker.
 9. The computer-implementedmethod of claim 7, wherein updating, by each worker, the one or morecounters respectively associated with the one or more live nodescomprises sequentially and iteratively scoring, by each worker and foreach live node, proposed numerical splits of the attribute valuesrespectively included in the one or more attributes associated with suchworker.
 10. The computer-implemented method of claim 7, whereindetermining, by each worker, whether each sample included in thetraining dataset in associated with one or more live nodes at the depthlevel comprises using, by each worker, a shared seed to evaluate abagging of each sample with respect to the one or more decision trees.11. The computer-implemented method of claim 7, wherein determining, byeach worker, whether each sample included in the training dataset inassociated with one or more live nodes at the depth level comprisesconsulting a mapping from sample index to node index.
 12. Thecomputer-implemented method of claim 1, wherein the plurality of livenodes are included in a plurality of different decision trees of the oneor more decision trees, such that each worker generates proposed splitsof its attribute data for live nodes included in the plurality ofdifferent decision trees.
 13. The computer-implemented method of claim1, wherein the plurality of live nodes are included in a single decisiontree of the one or more decision trees, such that each worker generatesproposed splits of its attribute data for live nodes included in thesingle decision tree.
 14. The computer-implemented method of claim 1,wherein generating the one or more decision trees further comprises:performing, by each worker associated with a final split, a second passover its corresponding attribute data to compute a bit conditionassociated with the final split.
 15. A computer-implemented method,comprising: obtaining, by one or more computing devices, a trainingdataset comprising data descriptive of a plurality of samples,respective attribute values for a plurality of attributes for each ofthe plurality of samples, and a plurality of labels respectivelyassociated with the plurality of samples; partitioning, by the one ormore computing devices, the plurality of attributes into a plurality ofattribute subsets, each attribute subset comprising one or more of theplurality of attributes; respectively assigning, by the one or morecomputing devices, the plurality of attribute subsets to a plurality ofworkers; and for each of a plurality of depth levels of a decision treeexcept a final level, each depth level comprising one or more nodes: foreach of two or more of the plurality of attributes and in parallel:assessing, by the corresponding worker, the attribute value for eachsample to update a respective counter associated with a respective nodewith which such sample is associated, wherein one or more counters arerespectively associated with the one or more nodes at a current depthlevel; and identifying, by the corresponding worker, one or moreproposed splits for the attribute respectively for the one or more nodesat the current depth level respectively based at least in part on theone or more counters respectively associated with the one or more nodesat the current depth level; and selecting, by the one or more computingdevices, one or more final splits respectively for the one or more nodesat the current depth level from the one or more proposed splitsidentified by the plurality of workers.
 16. The computer-implementedmethod of claim 15, wherein assessing, by the corresponding worker, theattribute value for each sample to update the respective counterassociated with the respective node with which such sample is associatedcomprises: sequentially across all of the plurality of samples:determining, by the corresponding worker, whether the sample isassociated with one of the one or more nodes at the current depth level;and when the sample is associated with one of the one or more nodes atthe current depth level, assessing, by the corresponding worker, theattribute value for the sample to update the respective counterassociated with the respective node with which such sample isassociated.
 17. The computer-implemented method of claim 15, furthercomprising: for each of a plurality of depth levels of the decision treeexcept the final depth level: generating, by the one or more computingdevices, two or more child nodes for at least one of the one or morenodes at the current depth level; and updating, by the one or morecomputing devices, a mapping to assign at least one of the plurality ofsamples, wherein the assignment of samples to child nodes is performedaccording to the final split selected for the node from which the childnodes depend.
 18. The computer-implemented method of claim 15, furthercomprising performing said steps of assessing, identifying, andselecting for each depth level of a plurality of decision trees inparallel.
 19. The computer-implemented method of claim 18, furthercomprising: providing, by the one or more computing devices, a pluralityof random seeds to the plurality of workers, wherein the plurality ofrandom seeds are respectively associated with the plurality of decisiontrees; and for each decision tree: for each of the plurality of depthlevels of the decision tree except the final level and for each of thetwo or more of the plurality of attributes and in parallel: using, bythe corresponding worker, the corresponding random seed to determine arespective number of instances that each sample is included in atree-specific dataset associated with the decision tree.
 20. Thecomputer-implemented method of claim 15, further comprising:partitioning the training dataset into a plurality of shards, each shardcontaining one or more samples; and performing out of bag evaluation ofthe one or more decision trees using the plurality of shards.
 21. Acomputing system comprising one or more computing devices configured toimplement: a manager computing machine; and a plurality of workercomputing machines coordinated by the manager computing machine, whereinthe plurality of worker computing machines comprise: a plurality ofsplitter worker computing machines that have access to respectivesubsets of columns of a training dataset, wherein each of the splitterworker computing machines is configured to identify one or more proposedsplits respectively for one or more attributes to which such splitterworker computing machine has access; and one or more tree builder workercomputing machines respectively associated with one or more decisiontrees, wherein each of the one or more tree builder worker computingmachines is configured to select a final split from the plurality ofproposed splits identified by the plurality of splitter worker computingmachines.
 22. The computing system of claim 21, wherein the plurality ofworker computing machines further comprise one or more out-of-bagevaluator workers that have access to respective shards of rows of thetraining dataset and compute an out-of-bag error.